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ABSTRACT 


A recurring pattern of access to existing databases, data analyses, formulation of new hypotheses, use of 
an experimental design, institutional review board approvals, data collection, curation, and storage within 
trusted digital repositories is observable during clinical research work. The workflows that support the 
repeated nature of these activities can be ascribed as a Canonical Workflow Framework for Research (CWFR). 
Disease area clinical research is protocol specific, and during data collection, the electronic case report 
forms can use Common Data Elements (CDEs) that have precisely defined questions and are associated with 
the specified value(s) as responses. The CDE-based CWFR is integrated with a biomedical research informatics 
computing system, which consists of a complete stack of technical layers including the Protocol and Form 
Research Management System. The unique data dictionaries associated with the CWFR for Traumatic Brain 
Injury and Parkinson’s Disease resulted in the development of the Federal Interagency Traumatic Brain Injury 
and Parkinson's Disease Biomarker systems. Due to a canonical workflow, these two systems can use similar 
tools, applications, and service modules to create findable, accessible, interoperable, and reusable Digital 
Objects. The Digital Objects for Traumatic Brain Injury and Parkinson’s disease contain all relevant information 
needed from the time data is collected, validated, and maintained within a Storage Repository for future 
access. All Traumatic Brain Injury and Parkinson’s Disease studies can be shared as Research Objects that 
can be produced by aggregating related resources as information packages and is findable on the Internet by 
using unique identifiers. Overall, the integration of CWFR with an informatics system has resulted in the reuse 
of software applications for several National Institutes of Health-supported biomedical research programs. 
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1. INTRODUCTION 


Progress and discoveries in disease area(s) research are possible by promoting the exploration of hypotheses 
and designing experiments to investigate fundamental biomedical questions that can result in new treatments 
for patients. A wide range of biomedical platforms with an infrastructure comprising of hardware, software, 
application, and services can be used to manage research data. The workflows increasingly require the 
use of high-performance computing for collecting, annotating, curating, analyzing, and storing data within 
repositories [1]. 


Despite the diversity and wide-ranging research work on various diseases, a recurring pattern is identifiable. 
This pattern includes access to existing disease-specific information, analyses of available data, generation 
of hypotheses, use of an experimental design, review by an institutional review board, and the collection, 
curation, storage, and preservation of research data within repositories. 


A recent white paper has conceptualized the patterns observed within laboratory research as a Canonical 
Workflow Framework for Research (CWFR) [2]. The elements that constitute a CWFR are the presence 
of recurring steps in research activities, and the capability to form canonical information packages and 
libraries. These recurring steps (sequential and/or iterative steps) can result in data and metadata collection, 
and storage of information packages within repositories. 


Conceptually, Hardisty and Wittenberg envisioned the CWER as the topmost layer of a framework stack 
(in their white paper). The CWFR in the framework stack is the layer that researchers primarily interact with 
and contribute to development and standardization. 


1.1 Background 


The United States National Institutes of Health (NIH) mission is to seek fundamental knowledge about 
the nature and behavior of living systems, and the application of that knowledge to enhance health, lengthen 
life, and reduce illness and disability [3]. It conducts and funds focused research on many diseases, including 
heart, lung, blood, diabetes, digestive, kidney, musculoskeletal, neurological, cancer, and several others. 


At the NIH, the National Institute of Neurological Disorders and Stroke (NINDS) funds and supports 
research programs and activities on many neurological disorders, including Traumatic Brain Injury (TBI) [4] 
and Parkinson's Diseases (PD) [5]. The NINDS supports investigator-driven research for both TBI and PD, 
and data is collected from multiple grant-supported research activities. 


TBI can result from damage to the brain from external forces and is classified as mild, moderate, or 
severe. Severe TBI can become life-threatening, and long-term sequelae effect of TBI includes cognitive and 
physical disability, and post-concussion syndrome, which could lead to chronic traumatic encephalopathy. 


Parkinson’s Disease results from motor system disorders that result in uncontrollable movements of the 
body with symptoms that can include tremors, muscle stiffness, bradykinesia (slowing of spontaneous and 
automatic movement that can make it difficult to perform simple tasks or rapidly perform routine movements), 
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and postural instability. The exact cause of PD is an area of intense interest; factors like hereditary and a 
combination of genetic and environmental factors are believed to cause the onset of PD. 


A lack of standard data collection methods remains a challenge for disease area research, particularly 
for TBI. However, efforts are underway by NIH, other US government agencies, and research organizations 
to address the importance of using common standards, especially for data collection. The TBI research 
community proposed the use of Common Data Elements during data collection for TBI research [6]. 


Disease area clinical research is protocol specific and involves the collection of data on specific variables 
that enable the analyses to prove or disprove research hypotheses. These variables can include data elements 
or questions that can be aggregated into Case Report Forms (CRFs), which are completed for each patient 
enrolled in a clinical study. Implementing data elements and electronic CRFs reduces study implementation 
time and facilitates aggregated data analysis from different sources - including different study sites [7]. 
Research protocols use commonly defined questions ( i.e., variables), CDEs as part of the CRF data collection 
process [8]. These questions can be associated with specified values and used in a questionnaire or CRF 
during clinical studies. 


The concept of CDEs is independent of a disease area; however, the use of CDEs for a specific disease 
depends on consensus and adoption by that research community. The advantage of using CDEs is that it 
standardizes data collection, improves data quality, facilitates data sharing, and provides opportunities for 
meta-analysis and comparison of results from different studies [9]. 


The NINDS in collaboration with the disease area research community developed a set of general CDEs 
that can be used in a variety of clinical studies. The NINDS CDE Project is a collaboration between NINDS 
and the neurological disorder research community, and provides information on using CDEs, and has also 
developed CRF templates tailored to research for neurological diseases and disorders [10]. Within the CDE 
project, a small set of data elements is defined as the Core CDEs and are relevant to all TBI clinical studies 
and research. For four study types, concussion/mild TBI, acute hospitalized, rehabilitation for moderate/ 
severe TBI, or epidemiological research on TBI, the CRFs include Core and Basic CDEs [11]. Depending 
on the particulars of the study, Supplemental CDEs may also be used and are available on the template 
CRFs. Updates to the CDEs are made in collaboration with the research community. 


For Parkison’s disease CDEs are also defined in terms of Core, Basic, and Supplementary. The CDE Project 
provides access to PD CDEs, CRF modules, and guidelines. The CRF modules for PD logically organize the 
CDEs for data collection, while the guidelines provide further information about the CDEs [12]. 


Here, we examine the application of the concept of CWFR for TBI and PD. We will discuss the use of 
CDEs as part of a CWER layer, and provide information on the Biomedical Research Informatics Computing 
System (BRICS) technical stack, and the Protocol and Form Research Management System (ProFoRMS) used 
to develop the Federal Interagency Traumatic Brain Injury Research (FITBIR) and Parkinson’s Disease 
Biomarker Program (PDBP) systems. In addition, we explain the role of BRICS components for enabling 
data objects to be Findable, Accessible, Interoperable, and Reusable (FAIR) [13]. 
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2. CDE-BASED CWER TECHNICAL STACK 


A recurring pattern in both TBI and PD clinical research is hypotheses formulation, data collection, 
analyses, and dissemination. The FAIR principles state that digital stewardship should promote the 
discoverability and reuse of digital objects, which include data, metadata, software, and workflows [13]. 
Furthermore, it posits that data and metadata should be accompanied by Persistent Identifiers, indexed 
in a searchable resource, retrievable by their identifiers, and use vocabularies that meet domain-relevant 
community standards. The principles serve as guidelines for improving data discovery and reuse within any 
infrastructure developed for various research enterprises. 


For disease area research, using CDEs during data collection is a way for Digital Objects (DOs) to be 
reusable as well as to reduce post hoc processing burden during data analyses. Therefore, a CWFR that 
utilizes CDEs plays an important role in enabling research data to be FAIR. 


2.1 Biomedical Research Informatics Computing System 


Building systems to support the CDE-based CWER layer requires an underlying stack comprising technical 
workflow, execution frameworks, and machinery (see Figure 2 of the white paper). Next, we discuss the 
implementation of a CWFR layer with an underlying stack of layers that constitute the BRICS, a service- 
oriented informatics system design that was used in the development of FITBIR and PDBP systems [14]. 


BRICS comprises infrastructure, business, service, and user interface layers, which provide tools and 
applications for assisting and creating information packages for maintaining patient confidentiality, metadata 
association, curation, validation, collection, and subsequent storage within a repository [14]. A variety 
of tools including the Global Unique Identifier (GUID) client, Translational, Validation, Submission, and 
Download tools are provided as part of the presentation layer. Individual modules for various components 
include the Data Dictionary (DD), Account Management, Query Tool, ProFoRMS, Meta Study, Repository 
Manager, and GUID (Figure 1). 


User Interface layer — GUID clients, User management module, Access 


A authorization management, CDE based CWFR 

i s 
Service Layer — BRICS Modules (GUID, DD, ProFoRMS, Image Submission, Data E 

p Query, and Meta Study tools) (8s 

U 

s 

A R 

T Business Layer — Logical interface with infrastructure components for delivering l 

p service options within the service layer F 

o Y 

R Infrastructure Layer — Network components, Hardware and Software (including 

M servers, operating system, applications), Storage Repository, available as local 


and/or Cloud based instances. 


Figure 1. The stack layers for Biomedical Research Informatics Computing System. 
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The implementation of the BRICS design resulted in FITBIR [15] and PDBP [16] that efficiently utilized 
the service modules shown in Figure 1, for recurring actions within the TBI and PD clinical research 
workflows. Both FITBIR and PDBP are supported by the complete stack of layers (shown in Figure 1) that 
is available at the NIH. Cloud computing can also be used as an infrastructure for deploying various types 
of workflows for biomedical research [17]. For example, the National Trauma Institute has deployed a 
cloud-based BRICS instance, the National Trauma Research Repository for supporting new and emerging 
data sharing needs for the trauma research community [18]. 


Both TBI and PD researchers can store data within FITBIR and PDBP repositories for long-term preservation 
and future access [19]. The research community-endorsed CDEs for TBI and PD produce information 
packages (submission, archive, access) that can be reused for further research. The integration of the CWFR 
layer with the BRICS results in disease-specific repositories (e.g., FITBIR and PDBP) that are trustworthy, 
and are consistent with the TRUST principles for digital repositories [20]. 


2.2 Protocol and Form Research Management System 


The BRICS instances for TBI, and PD use the DDs and CDEs with the Protocol and Form Research 
Management System (ProFoRMS). The main function of ProFoRMS is to optimize the clinical study process 
by providing functions to researchers for managing protocols, collecting new clinical data using electronic 
forms, and the submission of collected data for storage in designated repositories [21]. 


The underlying components of the ProFoRMS that support CDE-based CWFR are illustrated below 
(Figure 2). The software tools used to execute the ProFoRMS are shown within the Presentation, Application, 
and Database layers. The ProFoRMS also provides for automatic validation with the data dictionaries of the 
FITBIR and PDBP. Using the layered approach, introducing enhancements, changes in tools, and upgrades 
to improve system performance are facilitated. 


By using ProFoRMS, a library of files that are comma-separated values (CSV) format is produced and 
is structured to be consistent with CDE-variable names and values. In addition to FITBIR and PDBP, the 
ProFoRMS is adaptable for supporting other biomedical research programs. 


3. DEVELOPING FAIR DATA OBJECTS (FDOS) 


A CDE-based CWFR includes Data Dictionaries, Data Elements, Form Structures (FS), and electronic 
forms. The CDEs for a disease area can relate to a question on an electronic form and the responses to the 
questions are made by precisely defined numerical values. The FS serves as the containers for ordered data 
elements, and the electronic forms are developed using FS as their foundation. A list of Data Elements, FS, 
and Core CDEs for TBI is available at the FITBIR site [22] and similar resources are available for PD research 
at the PD site [23]. 
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Figure 2. The ProFoRMS components that support a CDE-based CWFR. 


The FDOs comprise the essential digital bit sequence, descriptive and technical metadata, and a persistent 
identifier schema for producing information package(s) that can be stored, preserved, and accessed over 
time [24]. For TBI and PD research, FDOs are generated for a research study by using the FITBIR and PDBP 
systems. Protecting personally identifiable information within the workflow process is critical during patient 
enrollment and throughout the research study. Therefore, before submission to FITBIR or PDBP repository, 
a canonical step involves the deidentification of patient information. The GUID application program [25] 
supports the deidentification of data. The program results in computer-generated random alphanumeric 
code (GUIDs) unique to each patient and stored within the FITBIR and PDBP repositories. The term “Global” 
in GUID for FITBIR and PDBP systems does not imply the findability of the patient data on the Internet, 
however, authorized researchers can access the GUIDs to find and link together all submitted data for a 
single patient who may have been part of multiple clinical studies. For the data set to be an FDO on the 
web, researchers can organize one or more datasets into a single entity called a ‘Study’ that serves as a 
container for the data and metadata with the associated description and methods used for collecting it. 
Using the DataCite membership available with the FITBIR and PDBP systems, Digital Object Identifiers are 
assigned for studies to cite in journal articles. 
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To ensure the proper formatting of files and the formation of FDOs, the Data Mapping and Transformation 
(DMT) tool [26] is used to translate data received from any user system, for submission to FITBIR and PDBP 
systems. In addition, the tool saves the translated data into a CSV file that can then be processed by the 
Validation tool. 


The DMT tool is operational on Java-enabled platform such as Windows, UNIX, or Macintosh OS X. 
Data collected by another generic data collection system (e.g., RedCap) can also be validated with the 
DMT tool before submission to the FITBIR and PDBP repositories. The FDOs for TBI and PD contain all 
relevant information needed from the time data is collected, processed, and stored within a specifically 
designated repository. Additionally, all related resources from a TBI and PD study can be shared as Research 
Objects (ROs). The ROs are represented as a Meta Study, that includes the results from other studies, and 
which are aggregated for additional analyses. The ROs have assigned DOls and are findable and accessible 
on the Internet. 


4. SUMMARY AND CONCLUSION 


The NIH supports extensive clinical research programs for many diseases. Clinical research has a recurring 
workflow pattern that involves hypotheses generation, experimental design, institutional review board 
approvals, data collection, curation, storage within repositories, analyses, and dissemination of research 
results. Using CDEs and Data Dictionaries during data collection can result in a CWFR layer that is 
supported by an informatics system. For disease area research, the underlying layered informatics technical 
stack for CWFR can comprise the BRICS that process, store, and preserve the data within trusted digital 
FITBIR and PDBP repositories. The integration of a CWFR layer within an informatics system has resulted 
in the reuse of software tools and applications in various NIH biomedical research programs. The Traumatic 
Brain Injury and Parkinson’s Disease FDOs contain all the relevant information needed from the time data 
is first collected to when it is validated and then maintained for future access and use within the designated 
storage repositories. Using the GUIDs, authorized researchers can access and link together all submitted 
data for a single patient who may have participated in multiple clinical studies in one or more disease 
areas. Finally, the use of TBI and PDBP data dictionaries results in DOs that are findable, accessible, and 
reusable for research. 
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